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Schedule 

8:00 

Which  formula  shall  1  use? 

8:30 

Choosing  the  right  formula:  Estimation 

9:15 

Break 

9:45 

Practical  example:  Tasiburn  Road 

10:30 

Choosing  the  right  formula:  Decision  Making 

11:15 

Did  1  take  enough  samples? 

11:30 

Conclude 

3 

Purpose  of  the  presentation 


•  To  better  understand  what  formula  to  use  and  what 
assumptions  are  required 

•  To  gain  an  understanding  of  how  sample  size  can 
change  depending  on  the  purpose  of  the  project 

•  To  know  if  (probably)  enough  samples  have  been 
taken 
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Which  Formula  Should  I  use? 


Ask  the  statistician:  How  many  samples? 

Me  (Statistician):  OK,  how  close  do  you  need  the  answer  to  be? 
You  (Investigator):  Maybe  within  +/-  lOppm 

How  sure  do  you  need  to  be? 

Pretty  sure,  is  that  the  95%  thing? 

You  are  talking  about  the  mean  aren’t  you? 

I  suppose  so 

What’s  the  variability  of  the  population,  i.e.,  variance? 

Don’t  know 

It  is  a  homogeneous  population  isn’t  it? 

I  guess 

Can  I  assume  Normality  of  the  population? 

Isn’t  that  usual? 

What  power  do  you  need? 

Huh?  I’ll  do  it  myself!  6 
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Look  in  a  Textbook 


From  one  standard  textbook:  n  = 


(ZlH 


■a/2 


)2  O2 


d2 


From  one  EPA  document: 

From  another  document: 

From  yet  another:  n  = 


2  _ 2 


n  — 


Ol-«  +  Z\-(3  )  O' 


(Zi-  +  S2 

d2 

(Zl-a/2  )2  S2 


+  1/2z1.a2 


d2 


+  1/2  zl-a/22 


...but  what  do  the  terms  mean  and  which  one  should  I  use? 
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Explaining  the  terms  used 

There  are  clearly  some  similarities: 

“d”  =  Within  how  much  does  the  estimate  need  to  be 

“a2”  =  The  variance  of  a  population 

“s2”  =  The  variance  of  a  sample 

“z”  =  Relates  to  the  Normal  distribution 

“1  -  a”  =  How  much  certainty  and  related  to  significance  level 

“1  -  p”  =  How  much  certainty  and  related  to  statistical  power 

...but  it  is  still  not  clear  what  these  terms  really  mean...  8 
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What  do  we  know  about  these  terms? 

“d”  =  OK,  this  we  can  determine,  the  part  of  estimation 
“a2”  =  The  variance  of  a  population  (how  do  we  know  this?) 
“s2”  =  The  variance  of  a  sample  (what  sample?) 

“z”  =  Relates  to  the  Normal  distribution  (i.e.,  the  bell  curve) 
“1  -  a”  =  Certainty  &  significance  level  (is  this  95%?) 

“1  -  p”  =  Certainty  &  statistical  power  (not  sure  about  this) 
...but  it  is  still  not  clear  where  we  get  these  things... 


Sorting  out  the  terms:  z 

-  a 

This  is  the  z-value  (Normal,  bell-shaped  curve)  that  has 

1  -  a  in  the  “body”  of  the  curve,  and  a  in  the  “tail” 

Level  of  Sianificance 

(a) 

—1  -  a 

10% 

0.10 

1.28 

5% 

0.05 

1.645 

2.5% 

0.025 

1.96 

a  1% 

0.01 

2.33 

0  2 

■1  -  a 

10 
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Sorting  out  the  terms:  z _p 

•  This  is  the  same  idea  as  shown  with  z1  _a  but  is  only  used 
when  the  sample  is  going  to  be  used  for  decision  making. 

•  It  is  related  to  the  statistical  power  of  a  specified  test 
(e.g.,  Student’s  f-test).  Power  =  1  -  p 

•  It  represents  the  chance  of  getting  it  wrong  when  the  Null 
Hypothesis  is  false  and  the  Alternative  Hypothesis  is  true. 

•  You  don’t  need  this  formula  if  you  are  doing  estimation. 

ii 


Sorting  out  the  terms:  o2 


This  is  the  variability  of  the  data  as  defined  by  the 
variance  (a2) 

It  is  very  rare  that  we  know  this  value  although  we 
can  try  and  estimate  it  by  learning  from  similar 
projects 

Sometimes  we  can  get  an  estimate  by  making 
assumptions  about  the  project 

The  more  we  can  assume,  the  better  the  estimate 
of  variability 


12 


201 1  EMDQ  Workshop  -  Arlington,  VA 


6 


How  Many  Samples  Are  Needed? 


March  29,  2011 


Assumptions  and  estimating  sigma 


We  can  obtain  a  rough  estimate  of  sigma  (from  which 
we  then  get  o2)  using  the  maximum  and  minimum 
known  data  values  (the  Range  of  the  data).  Divide  the 
Range  by  6  to  obtain  estimated  sigma. 


Assumption 

Justification 

Risk  of  er 

None 

Chebychev 

0.111 

Data  >  0 

Cantelli 

0.100 

Unimodal 

Vysochanski-Petunin 

0.049 

Bell-shaped 

Normal 

0.003 

Range  divided  by  6  entails  some  risk 


•  If  the  maximum  is  really  “way  off  the  scale”  then 
the  estimate  of  sigma  will  be  too  high  and  so  the 
number  of  samples  needed  will  also  be  high. 

•  Conversely,  if  we  underestimate  sigma  we  will 
think  we’  re  doing  better  than  we  really  are  as  we 
haven’ t  collected  enough  samples. 


Assumption 

Risk 

Conseauence 

None 

0.111 

Probably  sigma  underestimated 

Data  >  0 

0.100 

Probably  sigma  underestimated 

Unimodal 

0.049 

Likely  a  pretty  good  estimate 

Bell-shaped 

0.003 

Definitely  a  good  estimate 
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The  formulae  and  the  assumptions 

•  No  Assumptions:  Chebychev’s  Inequality 

Prob(|X  -  mI  -  key)  <  ^ 

•  All  data  >  0:  Cantelli’ s  Inequality 

Prob(X  -  jj  >  ko)  <  1“^2 

•  Unimodal:  Vysochanski-Petunin  Inequality 

Prob(X  -  m  >  ka)  <  0-444  for  k  >  1.633 

•  Bell-shaped:  Normal  Theory 

Prob(|X  -  p|  >  3o)  =  0.003 


Now  that  we  know  something  about 
the  inputs.... 


From  one  standard  textbook:  n  = 


(  Zl-g/2  )2  O2 

d2 


From  one  EPA  document: 

From  another  document: 


n  — 


n  = 


Ol-«  +Z]_/:ifcr 

d 2 

(Zl-D  +  Zi-p f  s2 


d2 


+  Yz 


From  yet  another: 


(z 


l-a/2 


)2  S2 


n  = 


d2 


+  1/2  zl-aJ22 


...  which  formula  should  we  use? 

16 
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Choosing  the  Right  Formula 
(Estimation) 


Example:  Field  sprayed  with  pesticide 


•  Ferris  Field  is  almost  rectangular,  roughly  200  x  150  yds 
(270,000  square  feet,  about  6  acres  or  roughly  20  houses). 

•  Several  years  ago  it  was  sprayed  with  a  pesticide  now 
regarded  as  hazardous.  We  need  to  estimate  the  average 
level  of  pesticide  present  in  the  top  6"  of  soil  at  Ferris 
Field. 

•  Reports  from  the  owner  of  the  site  (who  wishes  to 
redevelop  Ferris  Field  for  residential  use)  indicate  values 
as  high  as  90ppb  have  been  recorded.  Measurements  can 
be  taken  accurately  down  to  0.2ppb. 

•  The  long  term  exposure  level  is  50ppb. 

•  How  many  samples  should  we  take? 

18 
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Do  we  have  sufficient  information? 


No 


19 


Could  we  make  some  typical  scenarios  ? 


Yes 

Make  some  reasonable  assumptions  and  see  what  happens 

1.  What  are  we  asked  for?  “...estimate  the  average  level...  ” 
Good,  we’re  talking  Mean  and  no  decision-making  needed. 

2.  Do  we  know  the  standard  deviation?  “...as  high  as  90ppb...  ” 
OK,  the  maximum  seen  is  90  and  if  we  assume  the  pesticide 
doesn’t  occur  naturally,  then  0  must  be  the  minimum. 

3.  Within  how  much  should  the  estimate  be?  "...exposure  level 
is  50ppb...  ”  OK,  let’s  try  +/- 10%  to  start  with  i.e.,  +/-5ppb. 

4.  How  sure  do  they  want  to  be?  ". . " 

They  didn’t  say  so  let’s  try  95%  to  start  with.  20 
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What  formula  should  we  choose? 


•  We  should  discard  all  those  that  demand  the  population 
standard  deviation  (a)  as  we  don’t  know  it;  we  have  to 
estimate  it  from  the  available  data. 


•  How  shall  we  estimate  it?  From  the  maximum  and 
minimum  but  it  depends  on  assumptions: 


Assumption 

None 
Data  >  0 
Unimodal 
Bell-shaped 


Justification _ Risk  of  error 

Chebychev  0.111 

Cantelli  0.100 

Vysochanski-Petunin  0.049 
Normal  0.003 


Which  assumption  is  most  likely  to  hold? 

21 


Consider  the  problem  again 


•  Ferris  Field  is  very  large  and  was  sprayed  with  a  pesticide 
sometime  in  the  past. 

•  Its  unlikely  it  was  pure  pesticide  but  probably  mixed  with 
something  -  could  be  water,  we  don’t  know. 

•  Fairly  sure  that  it  would  have  been  reasonably  well  mixed 
although  there  could  be  some  variations  in  strength. 

•  Have  no  details  on  how  the  pesticide  was  applied  so  some 
areas  may  have  been  sprayed  twice. 

•  There  could  be  some  “hot-spots”  but  more  likely  a 

reasonably  uniform  contamination  with  high  values,  some 
medium  values,  and  low  values.  22 
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We  conclude  its  probably  unimodal 


•  If  it  is  unimodal  (some  kind  of  distribution  with  a 
maximum  but  we  don’t  know  the  shape)  then  taking  the 
Range  divided  by  6  will  give  us  a  reasonable  estimate  of 
sigma.  The  Vysochanski-Petunin  inequality  tells  us 
there’s  only  a  5%  chance  we’re  underestimating  sigma. 

•  If  we  could  assume  its  Bell-shaped  (Normal)  then  there 
would  less  than  0.1%  chance  of  underestimating  sigma 
but  this  would  be  pushing  it! 

•  Maximum  was  90ppb,  Minimum  has  to  be  Oppb  (actually 
0.2ppb  as  that  is  the  detection  level)  and  so 

90  -  0.02 

Estimated  sigma  =  g  =  14.966  i.e.,  roughly  15 

23 


Unimodal,  applying  the  formula 


•  As  we  estimated  sigma  from  the  data  it  is  “s  =  15”  and 
so  we  choose  a  formula  that  depends  on  “s”: 


n  = 


(Zl-g/2)2  S2 

d2 


+  %  Zl-a/22 


•  The  other  formula  containing  “s”  doesn’t  apply  when 
we  are  doing  estimation. 

•  We  can  now  calculate  “n”  for  various  combinations  of 
the  input  variables  a,  s,  and  d. 


24 
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Identifying  the  pieces 

(Zl-a/2)2  s2 

n  =  - : -  +  %  zi-a/22 

d2 

•  “n”:  the  number  of  random  samples  needed 

•  “d”:  Within  how  much  (10%  of  the  mean?  Which  is  5ppm 

20%  of  the  mean?  Which  is  lOppm) 

•  “s”:  The  estimated  standard  deviation  (15  in  this  case) 

•  “1-  a/2”:  How  sure  do  you  want  to  be  (90%,  95%,  97.5%,  99%) 


What  do  we  have? 


For  d  =  5ppb;  here’s  the  number  of  samples  needed  : 


s 

a/2  =  0.10 

a/2  =  0.05 

a/2  = 

0.025 

a/2  =  0.01 

15 

14 

26 

37 

52 

18 

22 

37 

52 

73 

20 

27 

45 

64 

90 

22 

33 

54 

77 

108 

25 

42 

69 

98 

139 

30 

60 

99 

141 

199 
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Let’s  eliminate  the  unlikely  scenarios 


The  pink  scenarios  are  unlikely  to  apply  to  this  problem 


s 

a/2  =  0.10 

a/2  =  0.05 

a/2  = 

0.025 

a/2  =  0.01 

15 

14 

26 

37 

52 

18 

22 

37 

52 

73 

20 

27 

45 

64 

90 

22 

33 

54 

77 

108 

25 

42 

69 

98 

139 

30 

60 

99 

141 

199 

Why  “unlikely  to  apply  to  this  problem”? 


•  The  a/2  =  0.01  column 

Choosing  such  a  stringent  value  is  rarely  applicable  in 
environmental  situations  as  it  is  not  “life  or  death”. 

•  The  a/2  =  0.10  column 

Choosing  such  a  relaxed  value  is  rarely  done  except  for 
controlled  exploratory  studies,  a/2  means  that  the  chance 
we  miss  the  true  mean  is  20%  (a  =  0.20)  which  is  high. 

•  The  s  =  25  through  30  rows 

For  “s”  to  be  25  would  imply  the  maximum  was  6  x  25  = 
150,  which  is  nearly  half  as  much  again  than  the  maximum 
we  observed.  As  this  is  a  long  way  past  the  maximum  we 
conclude  it  is  unlikely  to  be  this  high. 

28 
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What’s  left? 


•  The  sample  size  needed  is  between  26  and  77  depending 
on  our  estimate  of  “s”  and  choice  of  “a/2”,  assuming  that 
the  allowance  in  final  estimate  (d)  is  5. 

•  For  “d  =  10”  (previously  “d”  =  5)  we  recalculate  and  find: 


s 

0/2  =  0.05 

a/2  =  0.025 

15 

8 

11 

18 

11 

15 

20 

13 

18 

22 

15 

21 

•  If  we  assume  it  is  unlikely  the  budget  for  the  pesticide 
investigation  will  allow  for  more  than  30  samples,  the 
conclusion  is:  29 


What  to  report 


If  you  want  the  estimate  to  be  +/- 10%  (i.e.,  5ppb) 

Take  30  samples: 

•  You’d  be  pretty  sure  the  estimate  is  good  (roughly  95%  confidence) 

•  If  the  assumptions  were  off  just  a  little  then  you  couldn’t  be  sure  as  you 
haven’t  enough  samples. 

Take  20  samples: 

•  Even  if  the  assumptions  were  you  couldn’t  be  sure  as  you  haven’t  taken 
enough  samples. 

If  you  want  the  estimate  to  be  +/-  20%  (i.e.,  lOppb) 

Take  30  samples: 

•  Very  sure  the  estimate  is  good  (exceeds  95%  confidence)  even  if  the 
assumptions  were  not  true.  Probably  should  make  "d”  smaller  otherwise 
more  than  enough  samples  have  been  taken. 

Take  20  samples: 

•  Pretty  sure  the  estimate  is  good  even  if  the  assumptions  are  not  true 

(95%  confidence)  30 
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Applying  to  Ferris  Field 


Ferris  Field  was  200  x  150  yards,  equal  to  270,000  sq  ft. 

If  we  take  30  random  samples,  each  sample  would  have 
to  represent  9000  sq  ft  equal  to  about  32  x  30  yards.  Is  it 
feasible  that  a  single  sample  could  be  representative  of 
such  a  large  area?  (13,500  sq  ft  if  20  samples  are  taken) 

The  formula  assumes  a  random  sample  could  we  locate 
random  locations  at  Ferris  Field?  Have  access  to  GPS? 

Can  we  cut  down  on  the  number  of  samples  sent  for 
analysis?  Could  we  use  field  sampling  techniques?  Can 
we  change  sampling  schemes?  Are  there  alternatives? 


201 1  EMDQ  Workshop  -  Arlington,  VA 


16 


How  Many  Samples  Are  Needed? 


March  29,  2011 


Example:  Tasiburn  Road 
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Tasiburn  Road  Development  Project 


•  Tasiburn  road  is  a  2-lane  State-maintained  road  in 
western  Arkabama.  Prior  to  being  made  SR  409  it  was  a 
dirt  road  with  easy  access  both  sides. 

•  About  20  years  ago  waste  oil  was  sprayed  on  dirt  roads 
as  a  dust  suppressor  -  the  oil  contained  Glomerator. 

•  Glomerator  is  now  recognized  as  a  contaminant  that  is 
a  cause  for  concern. 

•  The  project  developer  must  determine  the  level  of 
Glomerator  at  the  site  before  taking  subsequent  action. 


35 


Tasiburn  Road  data  and  information 


•  The  Tasiburn  Road  Development  is  bounded  by  Cuyagua 
Creek  on  the  west,  a  disused  gasoline  station  to  the  north, 
SR  409  to  the  east  and  a  fence  to  the  south. 

•  Traffic  density  is  fairly  light  and  automobile  pollution  low. 

•  There  may  be  some  leakage  from  underground  storage 
tanks  remaining  at  the  disused  gasoline  station. 

•  Cuyagua  Creek  is  shallow,  average  depth  12”  but  subject  to 
spring  flooding  on  an  irregular  basis 

•  Core  samples  for  6  locations  gave  Glomerator  readings: 

21  ppm,  22ppm,  25ppm,  26ppm(2),  and  36ppm  (mean  = 
26ppm,  variance  =  28.4ppm2,  standard  deviation  =  5.329pm). 
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What  the  developer  must  achieve 


•  The  developer  must  convince  the  State  Regulator  that 
the  estimated  mean  level  of  Glomerator  remaining  at  the 
project  is  accurate  to  within  2  ppm. 

•  Also  present  evidence  that  the  samples  results  have 
credibility  and  will  be  accepted  by  the  State  Regulator. 

•  Show  that  a  reasonable  (defensible)  argument  may  be 
made  that  sufficient  samples  to  characterize  the  project 
have  been  taken. 

•  Show  that  the  estimated  mean  meets  commonly  held 
standards  of  certainty. 
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How  many  samples  does  the 
developer  need? 


From  one  standard  textbook:  n  = 


(Zl-o/2)2°2 


d2 


Oi-«  +  zx_p)2  cr2 

d2 

(Zl-g  +  Z1-p)2  S2  .  , 


From  one  EPA  document:  n 


From  another  document:  n  = 


d2 


+  1/2z1.a2 


From  yet  another:  (zi_a/2)2s2  .  „ 

n  =  -  +  /2(Z al2  ) 

d2 

Make  a  recommendation  and  suggest 
the  number  of  samples  needed. 
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Choosing  the  Right  Formula 
(Decision  Making) 


Example:  Field  sprayed  with  pesticide 


•  Ferris  Field  is  almost  rectangular,  roughly  200  x  150  yds 
(270,000  square  feet,  about  6  acres  or  roughly  20  houses). 

•  Several  years  ago  it  was  sprayed  with  a  pesticide  now 
regarded  as  hazardous.  We  need  to  determine  if  the 
average  level  of  pesticide  present  in  the  top  6"  of  soil  at 
Ferris  Field  exceeds  the  long  term  exposure  level  of  50ppb. 

•  We  need  to  take  sufficient  samples  such  that  if  the  true 
level  of  pesticide  is  60ppd  or  more,  we  would  be  90%  sure 
we  could  find  this. 

•  How  many  samples  should  we  take? 

40 
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Almost  the  same  problem  as  previously 
encountered  but  important  differences 


Previously: 

“We  need  to  estimate  the  average  level...” 

Now: 

“We  need  to  determine  if  the  average  level  of  pesticide... 
exceeds  the  long  term  exposure  level  of  50ppb” 

Now: 

“We  need  to  take  sufficient  samples  such  that  if  the  true  level 
of  pesticide  is  60ppb  or  more,  we  would  be  90%  sure  we 
could  determine  this” 
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We  will  use  the  same  assumptions 


•  Previously,  by  using  the  assumption  the  data  was 
unimodal,  we  reached  the  conclusion  that  an  estimate  of 
sigma  was  15. 

•  We  then  used  a  formula  to  find  “n”  (we  concluded  that  it 
was  20  -  30  depending  on  which  one  the  budget  would 
allow). 

•  For  decision  making  (“determine  if”)  with  criteria  for  how 
certain  we  need  to  be  (“would  be  90%  sure”),  we  should 
choose  an  appropriate  formula. 

•  It  will  be  similar,  but  more  complex,  than  the  one  used 
for  estimation. 

42 
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Unimodal,  applying  the  formula 


•  As  we  estimated  sigma  from  the  data  it  is  “s  =  15”  and 
so  we  choose  a  formula  that  depends  on  “s”: 


(Zl-a  +  Zl-p)2  S2  ,  „ 

n=  - ^2— 1 - -  +%Zl-a2 


•  But  what  is  z  1  _p  ? 

•  It  is  related  to  the  statistical  power  of  the  test  when  the 
Alternative  Hypothesis  is  true 

•  Need  to  talk  about  Null  and  Alternative  Hypotheses 
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Null  and  Alternative  Hypotheses 


•  The  Null  Hypothesis  (HJ  represents  the  “baseline” 
condition.  We  hold  on  to  the  Null  until  faced  with 
overwhelming  evidence  (data)  that  shows  it  can’t  be  true, 
in  which  case  we  choose  the  Alternative  Hypothesis  (HA). 

•  In  this  specific  case: 

•  H0:  The  average  (mean)  level  <  50 ppm 

•  Ha:  The  average  (mean)  level  >  50ppm 

•  Note  that  we  do  not  specify  what  exactly  it  is  under  the 
Alternative,  just  that  it  exceeds  what  the  Null  says  it  is. 


44 
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Null,  Alternative,  a,  and  p 


•  The  error  rates,  a  and  p,  are  defined  as: 

•  “a”  is  the  probability  of  rejecting  the  Null  Hypothesis 
when  it  is  really  true. 

•  “p”  is  the  probability  of  accepting  the  Null  Hypothesis 
when  it  is  really  false. 

•  In  this  specific  case: 

“a”  is  the  chance  you  say  the  mean  level 
is  above  50ppm  when  really  it  is  below  50ppm. 

“p”  is  the  chance  you  say  the  mean  level 
is  below  50ppm  when  really  its  above  50ppm. 
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Null,  Alternative,  a,  and  p 


•  “a”  is  decided  at  the  very  start,  often  set  at  10%  or  5%. 

•  "p”  is  specified  for  certain  values  for  HA  and  then  the 
number  of  samples  needed  to  achieve  this  calculated. 

•  For  our  example,  “a”  was  not  specified,  so  we  chose  5%. 

•  For  our  example,  “p”  was  specified,  but  indirectly: 

“if  the  true  level  was  60ppm  or  more,  we  would  like 
to  be  90%  sure  we  could  detect  this”. 

90%  sure  we’re  right  equals  10%  chance  that  we’re 
wrong. 

In  math  terms: 

1  -  p  =  0.90,  therefore  p  =  0.10  46 
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Howz^a  and  z . p  are  related 


1  -  p  is  the  area  under  HA  from  60 


Identifying  the  pieces 

(Zi-a  +  Z^p)2  S2 

n  =  - — -  +  %  z^2 

d2 

•  “n”:  the  number  of  random  samples  needed 

•  “d”:  Within  how  much  (difference  between  Ho  =  50ppm,  and 

the  specified  HA  =  60ppm) 

•  “s”:  The  estimated  standard  deviation  (15  in  this  case) 

•  “1-  a”:  How  sure  do  you  want  to  be  (90%  and  also  95%) 

•  “1-  p”:  Related  to  statistical  power  (90%,  z1  _p=  1.28  for  90%) 
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What  do  we  have? 


For  d  =  lOppb;  here’s  the  number  of  samples  needed  : 


s 

a  =  o.io 

a  =  0.05 

15 

16 

21 

18 

23 

30 

20 

28 

36 

22 

33 

43 

This  is  for  p  =  10% 
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Comparison  of  number  of  samples: 
estimation  versus  decision  making 


For  d  =  lOppb: 


Estimation 


s 

a/2  =  0.10 

a/2  =  0.05 

15 

8 

11 

18 

11 

15 

20 

13 

18 

22 

15 

21 

Decision  making 


s 

a  =  o.io 

a  =  0.05 

15 

16 

21 

18 

23 

30 

20 

28 

36 

22 

33 

43 

This  is  for  p  =  10% 


Note  that  decision  making  needs  more  samples  so 
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What  to  report 


If  you  took  30  samples: 

•  You’d  be  quite  sure  (over  95%  confidence)  that  enough  samples 
have  been  collected  unless  the  assumptions  are  well  off  in  which 
case  you’d  be  unsure  as  not  enough  samples  have  been  taken. 

If  you  took  20  samples: 

•  You’d  be  pretty  sure  (95%  confidence)  that  enough  samples  have 
been  taken  if  the  assumptions  are  true.  If  the  assumptions  are 
not  true  then  you’d  be  unsure  as  not  enough  samples  have  been 
taken. 


Applying  to  Ferris  Field 


Ferris  Field  was  200  x  150  yards,  equal  to  270,000  sq  ft. 

If  we  take  30  random  samples,  each  sample  would  have 
to  represent  9000  sq  ft  equal  to  about  32  x  30  yards.  Is  it 
feasible  that  a  single  sample  could  be  representative  of 
such  a  large  area?  (13,500  sq  ft  if  20  samples  are  taken) 

The  formula  assumes  a  random  sample  could  we  locate 
random  locations  at  Ferris  Field?  Access  to  GPS? 

Can  we  cut  down  on  the  number  of  samples  sent  for 
analysis?  Could  we  use  field  sampling  techniques?  Can 
we  change  sampling  schemes?  Are  there  alternatives? 
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Did  I  Take  Enough  Samples? 


You  don’t  know 

•  You  calculated  the  number  of  samples  needed,  took  the 
required  samples  (including  those  for  Quality  Assurance 
controls),  but  are  you  sure  you  took  enough? 

•  You  don’t  (and  may  never)  know. 

•  Could  you  use  the  results  from  the  sample  to  answer 
this? 

•  The  answer  is  “Yes,  maybe,  if  you’re  lucky...” 

54 
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“Back-estimating”  number  of  samples 


•  We  take  the  sample  standard  deviation  and  use  the 
formula  to  calculate  “backwards”  to  get  “n”.  Let’s  use 
the  example  for  decision  making: 

n-  +  W 

d2 

•  We  agreed  on  30  samples  (before  QA  additional 
samples)  based  on  a  sigma  of  15  (actually  we  noted  it 
could  have  been  as  high  as  18).  Suppose  we  did  this 
and  the  standard  deviation  of  the  sample  was  19.35. 

•  Using  the  formula  “backwards”  yields  “n”  =  34 


What  to  conclude 


•  Well,  using  the  sample  standard  deviation  we  should 
have  taken  34  instead  of  30.  Probably  the  30  we  took  is 
enough,  a  difference  of  4  is  not  important. 

•  But,  it  can  be  shown  that  a  confidence  interval  for  the 
unknown  standard  deviation  is  approximately  between 
15.41  and  26.01. 

•  Using  15.41  in  the  formula  gives  “n”  =  22 

•  Using  26.01  in  the  formula  gives  “n”  =  60 

•  So,  not  too  much  help;  best  advice  is  to  calculate, 
document,  back-estimate,  and  use  common-sense. 
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Reducing  Sample  Size  by  using 
Stratification 
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Sampling  &  Field  Variability 


Total  variability  of  any  project  has  two  parts: 

Field  Variability  +  Measurement  variability 

Field  vaMtfgle^  exceeds  ^Seasuremer/t  variability 

We  can  reduce  the  number  of  samples  by  reducing 
Field  variability.  Stratification  makes  more  homogeneous 
areas  within  the  entire  problem  and  we  can  find  the 
required  number  of  samples  for  each  area  and  then 
combine  for  an  overall  total. 
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Simple  Random  Sampling 

•  Used  when  population  variability  is  high 

•  Simple  in  concept  and  provides  proper  data 
(theoretical  support)  for  statistical  data  analysis 

•  Is  the  basic  building  block  of  more  complicated  (but 
more  effective)  probability-based  designs 

•  Difficult  to  find  individual  sample  locations 

•  Often  demands  a  large  number  of  samples 


Stratified  Sampling 


•  The  target  population  is  divided  into  contiguous  sub¬ 
populations  (strata)  of  approximately  the  same  variability 

•  Sampling  locations  are  selected  within  each  strata  using  some 
sampling  design 

•  Needs  information  on  what  criteria  are  meaningful  in  defining 
the  strata  boundaries 

•  Greatly  reduces  variability 
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Grid  (Systematic)  Sampling 


•  Collecting  samples  according  to  a  specified  pattern  at 
regular  intervals 

•  Can  yield  more  precise  estimates  of  population 
parameters  than  other  sampling  designs 

•  Easy  to  explain  and  implement  and  provides  uniform 
coverage  of  site  or  project 

•  Can  be  biased  if  the  sampling  grid  pattern  or  the 
regular  frequency  of  taking  samples  coincides  with  any 
pattern  of  contamination 
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Example:  Littlewood  Site 


•  Site  is  suspected  to  be  contaminated  with  arsenic  through  its 
production  work 


•  The  site  was  a  multi-purpose  factory 
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Project  objective 


•  The  project  objective  is  to  estimate  the  mean  arsenic  level 
(of  the  site  and  the  individual  areas)  with  minimal 
uncertainty. 

•  Uncertainty  can  be  measured  by  precision  or  variability  in 
the  conclusion. 

•  The  number  of  samples  needed  will  be  proportional  to  the 
variability  encountered,  high  variability  demanding  more 
samples  than  low  variability 

•  We  are  just  interested  in  the  field  aspect  -  not  the 
measurement  as  this  is  the  responsibility  of  others 
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The  size  of  the  project 


•  The  map  is  drawn  to  scale  and  so  it  may  be  deduced 
that  the  entire  site  is  about  6000  sq  yds 

•  For  clean-up  purposes  the  site  will  be  divided  into 
operational  units  size  1500  sq  yds  (roughly  the  size  of 
a  football  field  divided  by  4) 

•  Therefore  there  are  4  operational  units  (total)  in  this 
site  to  be  investigated  through  sampling 
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Preliminary  observations 


Avery  rough  estimate  of  the  total  variability  can  be 
deduced  from  the  range  in  data  values: 

range 

estimated  s  =  - 

6 

33.0-0.2 

6 

=  5.5 

The  estimated  total  variance  (s2)  is  then  (5.5)2  which 
is  30.25  and  approximately  equal  to  30 
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The  sample  size  per  operational  unit 


n 


o2 

z  l-cx/2  ® 

d2 


+  y2  z2 


1  -  a/2 


z  i-a/2  =  how  sure  we  need  to  be  (95%) 
d  =  how  close  we  need  to  be  (+/-  2ppm) 
n  =  the  formula  gives  21 .6 

i.e.,  We  need  22  per  operational  unit  together  with  around  4 
QC  samples  to  estimate  other  components 
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The  cost  for  the  project 

•  Total  samples  =  4  x  (22  +  4)  =104 

•  Square  grid  across  the  site 

•  Cost  to  analyze  each  sample  $1 ,000 

•  Total  cost  is  then  $104,000 

•  But  the  budget  is  only  $35,000 

More  work  is  needed  as  clearly  we  can 
only  afford  a  maximum  of  35  samples 
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Can  reduce  variability  through  stratifying 


•  Records  show  that  the  site  contains  two  distinct  areas.  A 
rectangular  area  where  production  was  conducted  and  a 
much  larger  area  containing  administration  offices  and  a 
parking  lot. 


Contamination  in  the  large 
area  believed  to  be  relatively  low 

You  are  improving  the  conceptual 
model  of  operations  at  the  site 
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Impact  of  stratification  on  sample 

size 

•  With  reduced  variability,  fewer  samples  are  needed 
to  achieve  the  same  precision 

•  The  high  value  (33ppm)  most  likely  came  from  the 
Production  Area  so  the  variability  estimate  is 
probably  valid  in  that  region 

•  Production  Area  (1430  sq  yds)  is  approximately  the 
same  as  an  operational  unit  (1500  sq  yds) 

•  This  would  require  26  samples  from  the  Production 
Area  (recall  the  formula  does  not  consider  target 
population  size) 
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Using  the  same  parameters  as  before 


•  Due  to  budget  constraints  this  would  leave  only  9  samples 
to  be  taken  from  the  Admin/Parking  area 

•  However,  does  the  Admin/Parking  area  have  the  same 
variability  as  the  Production  area?  Probably  not... 

•  Suppose  the  maximum  expected  in  Admin/Parking  was  lOppm, 
using  the  formula  would  demand  only  4.3  (5  plus  QC  samples) 
and  so  we  would  be  within  budget. 
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Conclusion  for  Littlewood 

Need  22  +  4  =  26  samples  for  Production  Area 

Need  5  +  1  =  6  samples  for  Parking/Admin 

As  the  budget  allowed  for  35  total,  the  remaining 
3  could  be  allocated  to  the  Production  Area,  thus 
improving  the  quality  of  the  resulting  estimates. 


Overall  Conclusion 

•  Formulae  found  in  books  assume  simple  random  sampling 

•  Stratification  can  reduce  costs  and  improve  precision 

•  More  complex  sampling  designs  can  produce  better  estimates 
than  simple  designs  but  need  more  planning 
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How  Many  Samples  Are  Needed 


Conclusions 


•  The  number  of  samples  needed  is  dependent  on: 

•  The  purpose  of  the  project 

•  d  ^  how  close  the  estimate  needs  to  be 

•  o2  +  the  variance  of  the  population  of  interest 

•  s2  the  variance  of  a  sample 

•  z  expected  distribution 

•  a  +  significance  level  and  certainty 

•  p  ^  statistical  power  and  certainty 

•  It  is  important  to  confirm  after  sampling  that  you  have  a 
sufficient  number  of  samples  to  meet  your  objectives. 

•  The  number  of  samples  necessary  can  be  greatly  reduced 
with  clever  sampling  designs. 
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Want  more  information? 


•  Guidance  on  Choosing  a  Sampling  Design  for 
Environmental  Data  Collection  QA/G-5S 

(www.epa.gov/quality) 

•  Statistical  Methods  for  Environmental  Pollution 
Monitoring 

Richard  O.  Gilbert 

•  Sampling 

Steven  K.  Thompson 
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Still  need  more  help? 


Kelly  Black 
Neptune  &  Company 
kblack@neptuneinc.org 


John  Warren 

Office  of  Environmental  Information,  EPA 
warren.john@epa.gov 
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